Instructions:
In Week 5-2, we learned the Probabilistic PCA model proposed by Tipping and Bishop (1999).
Consider the model $$ x_i = \mu + \theta w_i + \sigma \varepsilon_i, $$ where $\theta \in \mathbb{R}^{p \times r}$ is the loadings matrix (i.e., $\theta = VD$), $w_i \overset{\text{i.i.d}}{\sim} N(0, I_r)$, and $\varepsilon_i \overset{\text{i.i.d}}{\sim} N(0, I_p)$. For now, let's assume $\mu = 0_p$ for simplicity.
The likelihood function can be written as $$ \mathcal{L}(\theta, \sigma^2 | x_1, \dots, x_n) = \prod_{i=1}^n \int N(x_i | \theta w_i, \sigma^2 I_p) N(w_i | 0, I_r) dw_i, $$ and it's log-likelihood function is given by $$ \ell(\theta, \sigma^2 | x_1, \dots, x_n) = \sum_{i=1}^n \log \int N(x_i | \theta w_i, \sigma^2 I_p) N(w_i | 0, I_r) dw_i. $$ To derive the MLE, we will use the EM algorithm:
Derive the expectation-step (E-step) of the EM algorithm
Obtain the expressions of $\hat \theta$ and $\hat \sigma^2$ from the maximization step (M-step)
For $\theta$: $$ \begin{aligned} \frac{\partial Q}{\partial \theta} &= -\frac{1}{2\sigma^2} \frac{\partial Q}{\partial \theta} \sum_{i = 1}^n - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i] \\ &= \frac{1}{2\sigma^2} \sum_{i = 1}^n \frac{\partial Q}{\partial \theta}(- 2 x_i^T \theta E[w_i]+E[ ||\theta w_i||^2]) \\ 0 &= \frac{1}{2\sigma^2} \sum_{i = 1}^n (- 2 x_i^T E[w_i]+ 2 E[w_i^T \theta w_i]) \\ \sum_{i = 1}^n x_i^T E[w_i] &= \theta \sum_{i = 1}^n E[w_i^T w_i] \\ \theta &= \frac{\sum_{i = 1}^n x_i^T E[w_i]} {\sum_{i = 1}^n E[w_i^T w_i]} \\ \end{aligned} $$
For $\sigma^2$: $$ \begin{aligned} \frac{\partial Q}{\partial \sigma^2} &= \frac{\partial Q}{\partial \sigma^2} \left[ \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(2 \pi) - \frac{np}{2}log(\sigma^2)- \frac{1}{2}\sum_{i = 1}^n E[w_i^Tw_i] - \frac{nr}{2} log(2\pi) \right] \\ &= \frac{\partial Q}{\partial \sigma^2} \left[ \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(\sigma^2) \right] \\ &= \frac{\partial Q}{\partial \sigma^2} \left[ \frac{1}{2\sigma^2} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2}log(\sigma^2) \right] \\ 0 &= \left[ \frac{1}{2(\sigma^2)^{2}} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - \frac{np}{2\sigma^2} \right] \\ 0 &= \left[ \frac{1}{(\sigma^2)} \sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i]) - np \right] \\ \sigma^2 &= \frac{\sum_{i = 1}^n (x_i^T x_i - 2 x_i^T\theta E[w_i]+E[w_i^T \theta^T \theta w_i])}{np}\\ \end{aligned} $$
To implement the algorithm, we could do multiple iterations with different initlizations of our starting parameters. Our algorithm would be split into two parts, the E and M. Both would have their own respetive for loops to implement the steps that are outlined above. After each step with the parameters, we would then compare the likelihood of our new updated parameters compared to our old ones. If the difference is smaller than some chosen tolerance $\epsilon$, than we could conclude that our algorithm has converged.
In order to ensure orthonality, we could create a constaint in the form of a Lagrange multiplier at the end of the expectation step.
We have discussed how to estimate the Gaussian mixture model using EM in class. In this question, we explore the Bayesian finite Gaussian mixture model using MCMC.
Recall the finite Gaussian mixture model with $K$ components: $$ p(\boldsymbol x | \boldsymbol \pi, \boldsymbol \mu, \boldsymbol \sigma^2) = \prod_{i=1}^n \sum_{k=1}^K \pi_k N(x_i | \mu_k, \sigma_k^2), $$ where $\boldsymbol x = (x_1, \dots, x_n)$, $\boldsymbol \pi = (\pi_1, \dots, \pi_K)$ is a vector of mixing proportions, $\boldsymbol \mu = (\mu_1, \dots, \mu_K)$ and $\boldsymbol \sigma^2 = (\sigma_1^2, \dots, \sigma_K^2)$ are means and variances of Gaussian densities (here, $\mu_k$ and $\sigma_k$ are scalars).
In Bayesian setting, we introduce a latent variable $\boldsymbol z = (z_1, \dots, z_n)$, $z_j = k$, $k = 1, \dots, K$. Then, the model can be written as
\begin{align} & z_i | \boldsymbol \pi \sim \text{Multinomial}(\pi_1, \dots, \pi_K),\\ & x_i | z_i = k \sim N(x_i | \mu_k, \sigma^2_k). \end{align}We consider the following conjugate priors:
\begin{align} & \boldsymbol \pi \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_K),\\ & \mu_k|\sigma_k^2 \sim N(0, \rho_k \sigma_k^2),\\ & \sigma_k^{-2} \sim \text{Gamma}(a, b), \end{align}where $\alpha_1, \dots, \alpha_K, \rho_k, a, b$ are some hyperparamters which are fixed. Their values can be chosen by the user. For Dirichlet distribution, please check its density function here.
We will develop an MCMC algorithm for the model:
MCMCGaussianFiniteMixture, for the MCMC algorithm (choose $K = 3$).The structure of your function should look like this:
MCMCGaussianFiniteMixture(x, T = 5000, K = 3, put those hyperparameters here)
# Note: T is the total iterations, K is the number of mixture components
{
Step 1: choose initial values for parameters to start the algorithm
Step 2: for t = 1, ..., T (total iteration) # start MCMC iterations
Step 2.1 draw z_1, ..., z_n
Step 2.2 draw \pi_1, ..., \pi_K
Step 2.3: for k = 1, ... K:
Step 2.3.1 draw \sigma_k^{-2} (or \sigma_k^2)
Step 2.3.2 draw \mu_k
End the for loop
Collect values
End the for loop
Output: sample draws of mu, sigma, pi
}
To obtain draws from the Dirichlet distribution, use the function np.random.dirichlet
mouse.csv data with the function MCMCGaussianFiniteMixture you wrote (choose $K = 3$, set the total iteration to be $5,000$ or larger. For hyperparameters, you can choose $\alpha_1, \dots, \alpha_K = 1, \rho_k = 1$, $a = 2, b = 2$; other values are possible as well), GaussianMixture package). Comment on your findings (e.g., are those values from EM close to their posterior means? Do the 95\% credible intervals include those values from EM?)red color for Left Ear, green color for Right Ear, and blue for Head. Compare the plot with the one from EM (you can find the plot in the Week 5-2's lecture notes)Note: BayesianGaussianMixture in sklearn here does not use the MCMC algorithm. It uses the so-called variational Bayes method, which has not been discussed yet.
The data set n90_pol.csv contains information on 90 British university students who participated in a psychological experiment designed to look for relationships between the size of different regions of the brain and political views.
The variables amygdala and acc indicate the volume of two particular brain regions known to be involved in emotions and decision-making, the amygdala and the anterior cingulate cortex; more exactly, these are residuals from the predicted volume, after adjusting for height, sex, and similar anatomical variables. The variable orientation gives the subjects’ locations on a five-point scale from 1 (very conservative) to 5 (very liberal). orientation is an ordinal but not a metric variable, so scores of 1 and 2 are not necessarily as far apart as scores of 2 and 3.
# 1
import pandas as pd
import numpy as np
from IPython.display import display
import scipy.stats as ss
import plotly.express as px
import plotly.io as pio
# This ensures Plotly output works in multiple places:
# plotly_mimetype: VS Code notebook UI
# notebook: "Jupyter: Export to HTML" command in VS Code
# See https://plotly.com/python/renderers/#multiple-renderers
pio.renderers.default = "plotly_mimetype+notebook"
# Load the data
oi_brits = pd.read_csv("n90_pol.csv")
# Displays
display(oi_brits)
display(ss.gaussian_kde(oi_brits.amygdala).pdf(oi_brits.amygdala))
# Create the density and print the bandwidth
ss.gaussian_kde(oi_brits.amygdala).pdf(oi_brits.amygdala)
display(px.scatter(x = oi_brits.amygdala, y = ss.gaussian_kde(oi_brits.amygdala).pdf(oi_brits.amygdala), marginal_x = "histogram"))
print("The bandwidth is ", np.std(oi_brits.amygdala) * ss.gaussian_kde(oi_brits.amygdala).factor)
| subject | amygdala | acc | orientation | |
|---|---|---|---|---|
| 0 | 1 | 0.0051 | -0.0286 | 2 |
| 1 | 2 | -0.0674 | 0.0007 | 3 |
| 2 | 3 | -0.0257 | -0.0110 | 3 |
| 3 | 4 | 0.0504 | -0.0167 | 2 |
| 4 | 5 | 0.0125 | -0.0005 | 5 |
| ... | ... | ... | ... | ... |
| 85 | 86 | 0.0174 | -0.0242 | 2 |
| 86 | 87 | 0.0251 | -0.0087 | 3 |
| 87 | 88 | 0.0676 | 0.0120 | 2 |
| 88 | 89 | -0.0097 | -0.0239 | 3 |
| 89 | 90 | 0.0374 | 0.0502 | 3 |
90 rows × 4 columns
array([12.19781554, 2.70213481, 7.68356675, 3.44448479, 11.03490713,
12.60454171, 7.54193218, 12.57932449, 11.43798903, 2.67465351,
3.81388624, 1.66957433, 12.5221318 , 11.74789421, 12.02985469,
7.7528476 , 5.70870721, 12.47813876, 12.47813876, 3.79260877,
12.42605667, 11.90162535, 3.67591483, 8.07699268, 10.77466016,
11.07070238, 11.96010531, 10.16116555, 3.77428528, 12.53381016,
4.88587911, 4.04302659, 1.61323491, 9.58962624, 10.32554169,
6.33905279, 12.38862504, 8.34598165, 8.71478225, 3.84088187,
11.63228635, 9.18855761, 10.17941665, 7.20715352, 12.42605667,
8.97565612, 11.06646711, 12.22701322, 6.94060474, 12.398562 ,
11.97397281, 11.15978304, 7.09443158, 8.1140476 , 11.31763454,
8.71478225, 12.24488676, 3.75648315, 12.39956902, 12.61569282,
5.84134239, 12.32446606, 12.16102937, 7.82623488, 8.50625022,
12.53978508, 3.40086792, 8.09690171, 12.59957911, 11.59241292,
7.29008954, 12.40828129, 12.60101378, 8.52535839, 8.40411473,
11.5721775 , 3.90915262, 3.23732196, 8.03705373, 6.65078144,
11.3399079 , 3.7444858 , 5.48892257, 8.15639009, 1.11126412,
10.14291818, 8.73357019, 1.80721617, 11.85725938, 6.15145847])
The bandwidth is 0.013183004308375648
display(ss.gaussian_kde(oi_brits.acc).pdf(oi_brits.amygdala))
# Creating the new density
ss.gaussian_kde(oi_brits.acc).pdf(oi_brits.acc)
# Displaying
display(px.scatter(x = oi_brits.acc, y = ss.gaussian_kde(oi_brits.acc).pdf(oi_brits.acc), marginal_x = "histogram"))
print("The bandwidth is ", np.std(oi_brits.acc) * ss.gaussian_kde(oi_brits.acc).factor)
array([1.55592973e+01, 9.73042096e-04, 1.07941930e+01, 2.87869209e+00,
1.18306724e+01, 1.99726092e+01, 1.04468712e+01, 2.03210456e+01,
1.28365274e+01, 8.90904886e-04, 7.85206738e-02, 1.88843695e-01,
1.77286979e+01, 2.07381334e+01, 1.47985372e+01, 5.22115442e+00,
5.63435870e+00, 2.08886723e+01, 2.08886723e+01, 6.77737741e-02,
1.69156862e+01, 1.42970616e+01, 3.43284976e-02, 5.70636719e+00,
1.85533730e+01, 1.19138632e+01, 2.10422533e+01, 9.95568766e+00,
6.01285848e-02, 2.06532397e+01, 3.36745804e+00, 5.67171055e-01,
1.38801349e-01, 1.54838379e+01, 1.02994084e+01, 7.36955838e+00,
2.10956882e+01, 1.24031026e+01, 6.87204741e+00, 9.60594372e-02,
2.05320044e+01, 1.44709591e+01, 9.99391830e+00, 9.61625855e+00,
1.69156862e+01, 7.41189335e+00, 1.92830410e+01, 2.12097390e+01,
4.32255272e+00, 2.10795618e+01, 1.45729334e+01, 1.95078801e+01,
4.46109512e+00, 1.18410156e+01, 1.25179743e+01, 6.87204741e+00,
1.58007878e+01, 3.04646008e+00, 1.67278219e+01, 1.93714835e+01,
6.00695038e+00, 2.11703629e+01, 1.53802862e+01, 1.11422633e+01,
6.46316271e+00, 1.79143509e+01, 1.03642868e-02, 5.73865350e+00,
2.00640243e+01, 2.04557408e+01, 9.82349997e+00, 2.10623995e+01,
1.88140544e+01, 6.49962537e+00, 1.25441292e+01, 1.32159729e+01,
1.75198526e-01, 2.72708399e+00, 5.64246024e+00, 4.09578109e+00,
1.99247634e+01, 5.00823442e-02, 3.52766681e+00, 5.83681667e+00,
6.42947441e-03, 9.91743888e+00, 6.90998622e+00, 3.69901016e-01,
2.09087611e+01, 3.79460104e+00])
The bandwidth is 0.008262433703070296
# Forming the joint density and finding the densities
jointKDE = ss.gaussian_kde([oi_brits.amygdala, oi_brits.acc]).pdf([oi_brits.amygdala, oi_brits.acc])
print("The bandwidth for the amygdala is", ss.gaussian_kde([oi_brits.amygdala, oi_brits.acc]).factor * np.std(oi_brits.amygdala))
print("The bandwidth for the acc is ", ss.gaussian_kde([oi_brits.amygdala, oi_brits.acc]).factor * np.std(oi_brits.acc))
The bandwidth for the amygdala is 0.015316368655217481 The bandwidth for the acc is 0.00959951750187294
Since the data, as we will explore in the next question further, is not independent, we expect the joint density bandwidths to differ from the marginal bandwidths.
# Plotting
display(px.scatter_3d(oi_brits, x = "amygdala", y = "acc", z = jointKDE), title="Joint Density")
We can see from the joint density that the joint density plot that the KDE is not symmetric. Rotating the plot, we can see the distrbution has some skew , thus showing us that these variables are not independent. This would explain also why the bandwidths of each variable from the joint density are different than that when we visualize the marginal distributions. If they were independent, we would expect marginal and seperate joint bandwidths to be identitical.